fix(groupby): group by non-dimension coordinate names; fast multi-key grouping by names (#750, #753)#751
Open
FBumann wants to merge 9 commits into
Open
fix(groupby): group by non-dimension coordinate names; fast multi-key grouping by names (#750, #753)#751FBumann wants to merge 9 commits into
FBumann wants to merge 9 commits into
Conversation
`LinearExpression.groupby` could not group by an attached non-dimension
coordinate. `expr.groupby("period").sum()` raised `ValueError: period
already exists as coordinate or variable name`, and passing the coordinate
`DataArray` (`groupby(period)`) raised because the fast path dropped only
the dimension index, then renamed the group dim onto a name still held by
the attached coordinate.
Fix both paths:
- `sum()` resolves a string group naming an existing coordinate to that
coordinate so it takes the fast path, and drops every coordinate aligned
to the grouped dimension (index, MultiIndex levels, auxiliary coords)
before reshaping, since collapsing the dimension invalidates them all.
- The `groupby` property detaches an attached non-dimension coordinate used
as the group before handing it to xarray, so xarray does not try to
re-expand it when recombining groups (the `use_fallback=True` path).
`expr.groupby("period").sum()` now mirrors `xarray.Dataset.groupby`.
Closes #750
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reorganize the #750 coverage into TestGroupbyByAttachedCoordinate, a parametrized matrix that asserts the grouped expression against hard-coded `vars`/`coeffs` literals (on a deterministic 4- and 8-variable model) instead of comparing to a sibling computation that could share the same bug. Covers single-key (name / DataArray x use_fallback), multi-key (list / tuple x use_fallback), an extra auxiliary coord on the grouped dimension, and a 2-D variable that must keep its other dimension. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tract A single-element key list (`groupby(["period"])`) now groups like the scalar key, matching xarray -- it is unwrapped in both `sum()` and the `groupby` property. Multi-key grouping must be spelled with names (`["period", "season"]`); a list of `DataArray`s is unhashable and raises in xarray itself, so linopy mirrors that (covered by an explicit `pytest.raises` row in the matrix). Also shorten the test class docstring to house style. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The GH #750 detach must only drop *free* (non-indexed) coordinates. The earlier change also dropped a MultiIndex level when grouping by it via `use_fallback=True`, leaving the dimension without an index (`('snapshot',) are not coordinates with an index`). Guard the detach with `group.name not in data.xindexes` so MultiIndex levels are left intact. Grouping by a MultiIndex level now works on both paths (the pydata/xarray 6836 case, fixed upstream). Add a parametrized regression test over both levels and both paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extract the single-element-list unwrap + coordinate-name resolution shared by `sum()` and the `groupby` property into one `_resolve_group` helper, removing the duplication (and the drift between the two that caused the earlier MultiIndex-level regression). - Drop the now-stale `(GH #750)` references from code comments; the link lives in the release notes. - Add a test for grouping by a dimension coordinate name (the fast-path broadening), and note it in the release note. - Simplify `test_multi_key`: a multi-key group always uses the xarray fallback, so drop the redundant `use_fallback` parametrization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Grouping by a dimension coordinate or a MultiIndex level by name already worked; only the non-dimension (free) coordinate case was broken. Correct the release note, which had overclaimed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the verbose comment blocks; the rationale lives in the _resolve_group docstring and the regression tests enforce the invariants (e.g. test_multiindex_level guards the xindexes detach guard). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 4, 2026
63e82ef to
ae23a3c
Compare
63e82ef to
de067dc
Compare
`groupby(["a","b"]).sum()` previously dropped to the slow xarray fallback. Resolve a list of coordinate names (1-D, same dim) to a value frame so it rides the existing reindex fast path, then unstack the stacked result back into one dimension per key -- byte-identical to the fallback, sparse fill cells included. The DataFrame grouper is untouched and stays compact (stacked MultiIndex over observed combinations only), so this is non-breaking. One dimension per key is a dense cartesian grid, so a sparse key crossing materialises mostly-fill cells. Warn (pointing at the DataFrame grouper) when the grid is much larger than the observed combinations; the check reads the collapsed MultiIndex levels, so it is O(observed) and fires before unstack allocates. See #753; sparse-representation follow-ups tracked against #740. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
de067dc to
3cc774f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Brings
LinearExpression.groupbyto parity withxarray.Dataset.groupbyfor grouping by coordinate names — on the fast path, with no breaking change. Two related pieces:1. Group by an attached non-dimension coordinate (closes #750).
Both now work and take the fast path, mirroring
xarray. (Grouping by a dimension name always worked — only attached non-dim coordinates were broken.) The old workaround was to detach the coord first (expr.drop_vars("period").groupby(period).sum()).2. Multi-key grouping by names now takes the fast path (closes #753).
It now rides the reindex fast path and returns the same output — one dimension per key, byte-identical to the fallback, sparse fill cells included. The
pd.DataFramegrouper is untouched and keeps its compact stacked-MultiIndexoutput, so this is non-breaking.How
In
LinearExpressionGroupby:_resolve_groupnormalizes a key: unwrap a single-element list (groupby(["period"])→ scalar), and resolve a string coord name to its coordinate so it takes the fast path.sum()drops every coordinate aligned to the grouped dimension before reshaping, so an attached aux coord (including the one being grouped by) no longer collides on the finalrename.groupbyproperty detaches a free (non-indexed) coordinate before handing it to xarray — fixing theuse_fallback=Truepath — but never aMultiIndexlevel.unstacked back into one dimension per key.Memory
One dimension per key is a dense cartesian grid, so a sparse key crossing materialises mostly-fill cells (measured ~100× vs the compact
DataFramegrouper for a diagonal crossing — see #740). AUserWarningnudges sparse/high-cardinality users to theDataFramegrouper; it reads the collapsedMultiIndexlevels, so it is O(observed), not O(N), and fires beforeunstackallocates the grid. Getting separate dims and compact storage would need a sparse/long-format kernel — tracked in #757 (the groupby-densification follow-up) under the sparse-kernel umbrella #756.Tests
TestGroupbyByAttachedCoordinateasserts groupedvars/coeffsagainst hard-coded results on a deterministic model: single-key (name/DataArray×use_fallback), 2-D variable, dimension-coordinate-by-name, single-element list,MultiIndexlevel, and apytest.raisesrow pinning that a list ofDataArrays is unsupported.TestMultiKeyFastPathcovers the multi-key fast path: fast == fallback (list/tuple, including a sparse crossing), separate-dims-not-stacked, sparse-combination-filled, DataFrame-grouper-stays-compact, and the blow-up warning (fires when sparse, silent when dense). Full suite green; ruff + mypy clean.Context: relation to PyPSA usage and #744
This is a building block for the flat-dimension + auxiliary-level-coord direction discussed in #744. If
n.snapshotsbecomes a flat dimension carryingperiod/scenariolevel coordinates (instead of a stackedpd.MultiIndex), aggregating an expression over a level becomesexpr.groupby("period").sum()— exactly what this PR makes work, now also for multiple levels at once (groupby(["period", "scenario"])).PyPSA still carries per-period workarounds citing a broken MultiIndex groupby (pydata/xarray#6836); that upstream bug is now fixed (verified on xarray 2025.9.0), but those comments are about MultiIndex grouping and per-period rolling, orthogonal to this PR.
Closes #750, #753. Drafted with an agent (Claude Code).
🤖 Generated with Claude Code